This dataset has a total of 10 features and 53940 observations.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import io
import requests
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")
df_name = 'diamonds.csv'
df_url = 'https://raw.githubusercontent.com/akmand/datasets/main/diamonds.csv'
url_content = requests.get(df_url, verify = False).content
df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))
df.sample(10, random_state=999)
| carat | cut | color | clarity | depth | table | x | y | z | price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 38848 | 0.40 | Ideal | I | IF | 62.2 | 56.0 | 4.75 | 4.71 | 2.94 | 1050 |
| 9023 | 1.04 | Ideal | H | SI1 | 61.9 | 57.0 | 6.49 | 6.46 | 4.01 | 4515 |
| 51799 | 0.75 | Premium | D | SI2 | 60.6 | 56.0 | 5.94 | 5.90 | 3.59 | 2415 |
| 35562 | 0.35 | Premium | G | VS1 | 61.2 | 58.0 | 4.54 | 4.51 | 2.77 | 906 |
| 18923 | 1.49 | Very Good | G | SI2 | 62.5 | 58.0 | 7.20 | 7.26 | 4.52 | 7773 |
| 53847 | 0.71 | Ideal | H | VVS1 | 60.8 | 56.0 | 5.75 | 5.83 | 3.52 | 2741 |
| 848 | 0.72 | Ideal | H | VVS2 | 60.9 | 57.0 | 5.79 | 5.77 | 3.52 | 2869 |
| 9756 | 0.90 | Premium | G | VS1 | 62.7 | 58.0 | 6.06 | 6.15 | 3.83 | 4661 |
| 15655 | 1.04 | Premium | F | VS2 | 59.6 | 62.0 | 6.62 | 6.56 | 3.93 | 6278 |
| 3696 | 0.72 | Very Good | G | VVS2 | 60.1 | 60.0 | 5.79 | 5.82 | 3.49 | 3449 |
from tabulate import tabulate
table = [
['Name', 'Data Type', 'Units', 'Descriptions'],
['Carat','Numerical','Carat','Weight of the diamond'],
['Cut','Ordinal Categorical','NA','Quality of the cut'],
['Color','Ordinal Categorical','NA','Colour of the diamond'],
['Clarity','Ordinal Categorical','NA','Measurement of how clear the diamond is'],
['Depth','Numerical','Percentage','Total depth percentage\n(Calculated by dividing the diamond’s total height by its total width)'],
['Table','Numerical','Percentage', 'Width of top of diamond relative to widest point\n(Calculated by dividing the total width of the diamond by the width of the table)'],
['x','Numerical','Millimetre','Length of diamond in millimetre'],
['y','Numerical','Millimetre','Width of diamond in millimetre'],
['z','Numerical','Millimetre','Depth of diamond in millimetre'],
['Price','Numerical','USD','Price of diamond in US Dollars'],
]
print(tabulate(table, headers='firstrow', tablefmt='fancy_grid'))
╒═════════╤═════════════════════╤════════════╤═══════════════════════════════════════════════════════════════════════════════════╕ │ Name │ Data Type │ Units │ Descriptions │ ╞═════════╪═════════════════════╪════════════╪═══════════════════════════════════════════════════════════════════════════════════╡ │ Carat │ Numerical │ Carat │ Weight of the diamond │ ├─────────┼─────────────────────┼────────────┼───────────────────────────────────────────────────────────────────────────────────┤ │ Cut │ Ordinal Categorical │ NA │ Quality of the cut │ ├─────────┼─────────────────────┼────────────┼───────────────────────────────────────────────────────────────────────────────────┤ │ Color │ Ordinal Categorical │ NA │ Colour of the diamond │ ├─────────┼─────────────────────┼────────────┼───────────────────────────────────────────────────────────────────────────────────┤ │ Clarity │ Ordinal Categorical │ NA │ Measurement of how clear the diamond is │ ├─────────┼─────────────────────┼────────────┼───────────────────────────────────────────────────────────────────────────────────┤ │ Depth │ Numerical │ Percentage │ Total depth percentage │ │ │ │ │ (Calculated by dividing the diamond’s total height by its total width) │ ├─────────┼─────────────────────┼────────────┼───────────────────────────────────────────────────────────────────────────────────┤ │ Table │ Numerical │ Percentage │ Width of top of diamond relative to widest point │ │ │ │ │ (Calculated by dividing the total width of the diamond by the width of the table) │ ├─────────┼─────────────────────┼────────────┼───────────────────────────────────────────────────────────────────────────────────┤ │ x │ Numerical │ Millimetre │ Length of diamond in millimetre │ ├─────────┼─────────────────────┼────────────┼───────────────────────────────────────────────────────────────────────────────────┤ │ y │ Numerical │ Millimetre │ Width of diamond in millimetre │ ├─────────┼─────────────────────┼────────────┼───────────────────────────────────────────────────────────────────────────────────┤ │ z │ Numerical │ Millimetre │ Depth of diamond in millimetre │ ├─────────┼─────────────────────┼────────────┼───────────────────────────────────────────────────────────────────────────────────┤ │ Price │ Numerical │ USD │ Price of diamond in US Dollars │ ╘═════════╧═════════════════════╧════════════╧═══════════════════════════════════════════════════════════════════════════════════╛
Thus, predicting the diamond's price in US dollars based on available feature of the diamond is the main goal of this model. In addition, we will also perform some data cleaning and processing, as well as some data exploration and visualization using charts and graphs to gain some insights about the connections between the variables in the dataset, which is the purpose of this Phase 1 report. We will then be able to determine which feature appear to be the most accurate forecasters of the diamond price based on this connection.
There are a total of 1889 outlier diamonds in this dataset (See code below for calculation). These outliers represent diamonds which has a higher and lower carat than 1.5 times of the interquantile range. Figure 1 below illustrates the box plot to visualise the five-number summary: min, lower quartile, median, higher quartile, max, and the outliers present in the dataset.
print(df['carat'].describe())
quantile_1 = df['carat'].quantile(0.25)
quantile_3 = df['carat'].quantile(0.75)
iqr = quantile_3 - quantile_1
upper_whisker = quantile_3 + (1.5*iqr)
lower_whisker = quantile_1 - (1.5*iqr)
print('Upper Whisker: ', upper_whisker)
print('Lower Whisker: ', lower_whisker)
print('Number of Diamond Outliers: ', df[(df['carat'] < lower_whisker) | (df['carat'] > upper_whisker)]['carat'].shape[0])
count 53940.000000 mean 0.797940 std 0.474011 min 0.200000 25% 0.400000 50% 0.700000 75% 1.040000 max 5.010000 Name: carat, dtype: float64 Upper Whisker: 2.0 Lower Whisker: -0.5599999999999999 Number of Diamond Outliers: 1889
plt.figure(figsize = (20,10))
bp0 = plt.boxplot(df['carat'], 0, 'red', patch_artist=True)
plt.title('Figure 1: Carat of Diamond (Outliers Included)', fontsize = 20)
plt.ylabel('Weight (Carat)')
for box in bp0['boxes']:
box.set(color='red', linewidth=1)
box.set(facecolor='cyan')
plt.show()
Now, we will remove the outliers and assign the dataframe "carat" column with values that are not outliers.
df = df[(df['carat'] <= upper_whisker) & (df['carat'] >= lower_whisker)]
df.shape
(52051, 10)
Hence, the total number of records is now 52051 after 1889 records of outliers has been removed. Figure 2 below displays the box plot of the dataset which now has no outliers.
plt.figure(figsize = (20,10))
bp0 = plt.boxplot(df['carat'], 0, 'red', patch_artist=True)
plt.title('Figure 2: Carat of Diamond (Outliers Excluded)', fontsize = 20)
plt.ylabel('Weight (Carat)')
for box in bp0['boxes']:
box.set(color='red', linewidth=1)
box.set(facecolor='cyan')
plt.show()
print(f"\nNumber of missing values for each column:")
print(df.isnull().sum())
Number of missing values for each column: carat 0 cut 0 color 0 clarity 0 depth 0 table 0 x 0 y 0 z 0 price 0 dtype: int64
There are no adjustments required in this data types because the data types of each feature match what we wanted it to be.
print(df.dtypes)
carat float64 cut object color object clarity object depth float64 table float64 x float64 y float64 z float64 price int64 dtype: object
As the data has more than 5000 rows, random sampling is done to get only 5000 rows out of the remaining 53940 rows for ease of computation. At the end, we display 5 random rows from our cleaned data.
df = df.sample(n=5000, random_state=999)
df.shape
df.sample(5, random_state=999)
| carat | cut | color | clarity | depth | table | x | y | z | price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3677 | 0.70 | Good | F | VVS2 | 62.5 | 58.0 | 5.68 | 5.75 | 3.57 | 3445 |
| 46558 | 0.51 | Very Good | F | VS2 | 60.5 | 63.0 | 5.17 | 5.11 | 3.11 | 1781 |
| 16590 | 1.20 | Premium | H | VS1 | 61.3 | 58.0 | 6.85 | 6.81 | 4.19 | 6626 |
| 45199 | 0.51 | Ideal | E | VS2 | 62.6 | 55.0 | 5.08 | 5.11 | 3.19 | 1656 |
| 45123 | 0.77 | Fair | D | SI2 | 65.1 | 63.0 | 5.71 | 5.65 | 3.70 | 1651 |
In the figure below, we count the total number of diamonds based on their cut quality to see which cut is has the highest frequency in a descending order and as a result, the "Ideal" cut has the highest frequency in the dataset, whereas the "Fair" quality cut is the lowest type. However, the figure also shows that the frequency for "Premium" and "Very Good" quality diamond cuts are about equal, with both of them in between the range of 1000 and 1500.
plt.figure(figsize = (20,5))
fig = sns.histplot(df['cut'], kde=True, bins=50).set_title('Figure 3: Quality of Diamond Cut Count in Diamonds', fontsize = 20)
fig = plt.title('Figure 3: Histogram of Quality of Diamond Cut Count in Diamonds', fontsize = 20)
plt.xlabel('Cut Quality', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.show()
The diamond's color is graded on a letter scale from "D" to "Z", with "D" being the best and "Z" being the worst. A diamond with a color grading scale of "D", "E", or "F" is considered as excellent and colorless, and it appears cold to the naked eye. On the other hand, the "G", "H", "I", and "J" colour grades are considered as "near colorless". The bar chart below displays the amount of diamonds for each color grade in the dataset.
plt.figure(figsize = (20,5))
fig = sns.countplot(x = 'color', data = df, palette = None, order = df['color'].value_counts().index)
fig = plt.title('Figure 4: Barplot of Colour Grades Count in Melbourne Housing', fontsize = 20)
plt.xlabel('Color Grades', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.show()
All diamonds have imperfections, but a diamond's overall clarity is determined by the quantity of imperfections, combined with the type and position of inclusions. Crystals, pinpricks, needles, and other objects are examples of inclusion types. Either white carbon or black carbon can be an inclusion. A diamond clarity grade is used to categorise diamonds and the range of clarity grading ranges from perfect (FL) to included (I). The quantity of diamonds for each clarity grade in the dataset is shown in Figure 5 below.
plt.figure(figsize = (20,5))
fig = sns.countplot(x = 'clarity', data = df, palette = 'deep', order = df['clarity'].value_counts().index)
fig = plt.title('Figure 5: Amount of Diamond Clarity Grades ', fontsize = 20)
plt.xlabel('Clarity Grades', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.show()
The diamond clarity grade is as follows:
Carat is also called a carat's weight because it refers to how much a diamond weighs rather than how big the diamond is. The violin plot below depicts the distribution of diamond carat weight ranges in this dataset. There are more light weighted diamonds than heavier weighted diamonds, as shown in the chart below.
plt.figure(figsize = (15,8))
sns.violinplot(df['carat'], kde=False, bins=50).set_title('Figure 6: Violinplot of Carat', fontsize = 20)
plt.xlabel('Carat Weight', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.show()
The histogram below illustrates the distribution of the diamond's price range to determine the most typical price range for diamonds in this dataset. As seen in the chart beneath, the distribution is right-skewed (or positively skewed distribution), meaning that there are more diamond that falls into the lower price range than higher price range. In the lower price range, or more precisely below $2500 USD, the chart also features a single prominent peal (unimodal). In addition, the histogram indicates that the diamonds' mean price is higher than their median price.
plt.figure(figsize = (15,8))
sns.histplot(df['price'], kde=True, bins=50).set_title('Figure 7: Histogram of Price', fontsize = 20)
plt.xlabel('Price (USD)', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.show()
Figure 8 demonstrates how the diamond's cut might affect its price. We can see that "Premium" quality tends to be the most expensive from all the cuts. Followed by "Very Good", "Good", "Fair" and lastly "Ideal" looking at their lowest price range.
plt.figure(figsize = (20,7))
sns.boxplot(data = df, x="price", y="cut", palette='Paired')
plt.title('Figure 8: Boxplot of Price by Diamond Cut Quality', fontsize = 20)
plt.xlabel('Price (USD)', fontsize = 15)
plt.ylabel('Cut Quality', fontsize = 15)
plt.show()
Figure 9 below demonstrates how the diamond's color might affect its price and reveals that the colour grade "J" generally has a higher price than the other varieties, while the colour grade "E" has the lowest. Additionally, the chart reveals that the median price for both "F" and "G" colour grade are almost the same, which is $2500 USD, but the highest median price is the "J" colour grade, with "E" having the lowest median price.
plt.figure(figsize = (20,7))
sns.boxenplot(data = df, x="color", y="price", palette='Paired')
plt.title('Figure 9: Boxenplot of Price by Diamond Colour Grade', fontsize = 20)
plt.xlabel('Colour', fontsize = 15)
plt.ylabel('Price (USD)', fontsize = 15)
plt.show()
Figure 10 observes the relationship between the diamonds' price and its carat and shows a strong positive correlation (0.9) between the two variables, price and carat, meaning that both the carat and price increase and decrease together in a similar fashion. To validate that figure 8 shows a strong positive correlation, we have calculated it using the Pearson Correlation Coefficient formula and display its result down below.
Pearson Correlation Coefficient:
from scipy.stats import pearsonr
corr, _= pearsonr(df['carat'], df['price'])
print('Pearson Correlation Coefficient: ', corr)
Pearson Correlation Coefficient: 0.903461781960776
from turtle import color
plt.figure(figsize = (15,8))
plt.scatter(df['carat'], df['price'], alpha = 0.3, color="orange")
plt.title('Figure 10: Scatterplot of Price by Carat', fontsize = 20)
plt.xlabel('Carat (Weight)', fontsize=15)
plt.ylabel('Price (USD)', fontsize=15)
plt.show()
Figure 11 examines the relationship between the price of diamonds and their length, revealing a strong positive correlation (0.86) between the two variables, price and length, implying that both the length and price increase and decrease in tandem. We calculated the Pearson Correlation Coefficient formula and displayed the result below to validate that figure 11 shows a strong positive correlation.
Pearson Correlation Coefficient:
from scipy.stats import pearsonr
corr, _= pearsonr(df['carat'], df['price'])
print('Pearson Correlation Coefficient: ', corr)
Pearson Correlation Coefficient: 0.903461781960776
from turtle import color
plt.figure(figsize = (15,8))
plt.scatter(df['carat'], df['price'], alpha = 0.3, color="blue")
plt.title('Figure 11: Scatterplot of Price by Length', fontsize = 20)
plt.xlabel('Length (mm)', fontsize=15)
plt.ylabel('Price (USD)', fontsize=15)
plt.show()
Information on diamond prices according on clarity is shown in Figure 12. The SI2 clarity of a diamond has the greatest median price, while the IF clarity has the lowest median price. For the sake of clarity, the VS2, VS1, SI1, and SI2 have the highest upper quartile prices. On the other side, I1, VVS2, VVS1, and IF have the lowest upper quartile price based on clarity.
plt.figure(figsize = (20,7))
sns.boxplot(data = df, x="clarity", y="price", palette='Paired')
plt.title('Figure 12: Boxplot of Price by Clarity', fontsize = 20)
plt.xlabel('Clarity', fontsize = 15)
plt.ylabel('Price (USD)', fontsize = 15)
plt.show()
plt.figure(figsize = (15,8))
fig_4 = sns.barplot(x ='cut', y ='price', hue = 'color', data = df, palette='Set3')
plt.title('Figure 13: Barplot of Diamond Prices by Cut Quality and Colour', fontsize = 20)
plt.xlabel('Quality Cut', fontsize = 15)
plt.ylabel('Diamond Price', fontsize = 15)
Text(0, 0.5, 'Diamond Price')
plt.figure(figsize = (25, 10))
sns.scatterplot(df['carat'], df['price'], hue = df['clarity'])
plt.title('Figure 14: Scatterplot of Price by Carat coloured by Clarity', fontsize = 20)
plt.xlabel('Carat', fontsize = 15)
plt.ylabel('Price (USD)', fontsize = 15)
plt.legend(loc = 'upper left')
plt.show()
plt.figure(figsize = (25,10))
sns.barplot(df['color'], df['price'], hue = df['clarity'])
plt.title('Figure 15: Barplot of Colour Grade by Price, in terms of Clarity', fontsize = 20)
plt.xlabel('Colour', fontsize = 20)
plt.ylabel('Price (USD)', fontsize = 20)
plt.legend(loc = 'upper right', fontsize = 20)
plt.show()
plt.figure(figsize = (25, 10))
sns.stripplot(df['cut'], df['price'], hue = df['color'])
plt.title('Figure 16: Strip Plot of Cut Vs Price, in terms of Colour', fontsize = 20)
plt.xlabel('Cut', fontsize = 20)
plt.ylabel('Price', fontsize = 20)
plt.legend(loc = 'upper right', fontsize = 20)
plt.show()
plt.figure(figsize = (25, 10))
sns.scatterplot(df['carat'], df['price'], hue = df['cut'])
plt.title('Figure 17: Scatter Plot of Carat Vs Price, in terms of Cut', fontsize = 20)
plt.xlabel('Carat', fontsize = 20)
plt.ylabel('Price', fontsize = 20)
plt.legend(loc = 'upper left', fontsize = 20)
plt.show()
The history of mankind's obsession with the diamonds is well-documented, and it still captivates people today. Diamonds are the only precious stone made entirely of pure carbon, a diamond is transparent and colourless. Since they are the toughest gemstones known to man, only other diamonds can scratch them. About 100 miles below the surface of the Earth, in the upper mantle, diamonds are created. The combination of high temperature and high pressure is what is required for the growth of diamond crystals in the Earth. There is also a lot of pressure due to the weight of the underlying rock pressing down.
Diamonds are very scarce since they required incredibly strong forces to generate them. Diamonds are therefore considered to be Very Expensive. Pricing for diamonds comprises a complicated procedure that is influenced by a number of variables, including carat, cut, colour, and price. This report examines the relationship between these variables and visualises it.
About 50K observations are included in the diamond.csv dataset file, which also includes 10 variables like as carat, cut, colour, clarity, depth, table, price, and x (length in mm), y (width in mm), and z. (depth in mm). Overall, a tidy dataset without any errors or missing values.
Summary of Data Description
In order to uncover the patterns concealed behind the numbers, we conducted an exploratory data analysis on the diamonds dataset. First, we split up all the variables into two groups: category and numerical. The goal variable price has a significant positive connection with the variables carat, x, y, and z, according to our calculations and analysis of the correlation between all the quantitative variables.
Along with plotting pricing versus each classified variable, we examined the patterns. Carat, x, y, and z, in that order, can be inferred to be the most crucial elements in determining a diamond's price. Therefore, the greater weight (carat) and size, the more expensive the diamond will be. Excellent clarity, colour, and cut are other deciding elements, however they are not as important as the ones discussed earlier. The least significant elements for determining a diamond's price appear to be depth and table characteristics. Althought visualizing data provides us valuable insights on the relationships between the diamond's variables, further investigation is required using statistical modeling to fully understand the characteristics of the data for predicting the diamond's price strategically in Phase 2.